# Amplifying Effective CXL Memory Bandwidth for LLM Inference via Transparent Near-Data **Processing**

Rui Xie<sup>1</sup>, Asad Ul Haq<sup>1</sup>, Linsen Ma<sup>1</sup>, Yunhua Fang<sup>1</sup>, Zirak Burzin Engineer<sup>2</sup>, Liu Liu<sup>1</sup>, and Tong Zhang<sup>1</sup> <sup>1</sup>Rensselaer Polytechnic Institute, Troy, NY, USA <sup>2</sup>Wiseburn Da Vinci Science, El Segundo, CA, USA

Abstract-Large language model (LLM) inference requires both high bandwidth and massive memory capacity, vet HBM alone cannot provide sufficient capacity to hold full model weights and expanding KV caches, especially for long-context serving. This necessitates model and KV cache offloading over a heterogeneous memory hierarchy, for which a CXL-based memory pool emerges as a promising candidate to provide a cost-effective and elastic tier. The value of such a tier ultimately depends on its effective bandwidth, as both the PCIe link and the DDR-class channels inside CXL memory devices offer far lower bandwidth than host-side HBM. To address this limitation, we introduce CXL-NDP, a transparent near-data processing architecture for CXL memory devices that amplifies effective bandwidth without modifying the standard CXL.mem interface. CXL-NDP improves efficiency in two ways: (i) a precisionscalable bit-plane layout that enables CXL memory devices to gracefully support dynamic model and KV cache quantization while sustaining nearly full effective bandwidth utilization, and (ii) transparent lossless compression of model weights and KV cache within the CXL memory device, improving effective intra-CXL DRAM bandwidth and paving the way for richer in-device computation in the future. Evaluations show that CXL-NDP reduces memory footprint by 25.2% for weights and 46.9% for KV cache without accuracy loss. DRAMSim3-based simulations show that it lowers DRAM access energy by up to 40.3% and reduces model-load latency by up to 42.1%. In an end-to-end evaluation serving, CXL-NDP improves inference throughput by 43% and extends the maximum context length by 87% before exhausting memory. RTL synthesis in a 7 nm process shows that a compression engine provisioned for up to 2 TB/s throughput requires only about 5.7 mm<sup>2</sup> of silicon area, demonstrating that high-speed compression can be integrated with modest cost. By placing intelligence entirely inside the CXL memory device and preserving the standard CXL.mem interface, CXL-NDP requires no modification to AI models or applications, significantly lowering the barrier for real adoption while offering a scalable path for future generative AI infrastructure.

Index Terms-Large language model, CXL, lossless compression, quantization, pruning, efficient AI.

## I. INTRODUCTION

ARGE language models (LLMs) dominate today's AI serving workloads, and their memory demand grows with both model size and sequence length. Model weights consume hundreds of gigabytes, while the key-value (KV) cache can quickly exceed weights as context length increases. This dual demand for bandwidth and capacity makes the memory system, rather than compute, the critical bottleneck in LLM, fies effective bandwidth without modifying the CXL mem. This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be

inference. High-bandwidth memory (HBM) is indispensable as the bandwidth tier. Modern HBM3E stacks deliver terabytes per second per package, making them well suited to sustain AI accelerator throughput. However, HBM has two fundamental limitations. First, capacity is tightly constrained by packaging and thermal limits, leaving a single GPU with at most a few hundred gigabytes [1]. Second, HBM costs at least three times more per GB than DDR5 [2], making it economically infeasible to provision enough HBM for full models and large KV caches. As a result, HBM alone cannot satisfy the memory needs of generative AI.

This motivates heterogeneous memory hierarchies in which HBM remains the bandwidth tier while a lower-cost pool provides elastic capacity. CXL-based memory has emerged as a promising candidate. CXL allows commodity DDRclass memory to be attached as an expandable tier with load/store semantics. It can be flexibly pooled across hosts to reduce stranded capacity and, in modeled deployments, lowers effective cost per GB by 50-55% [3]. These advantages make CXL memory attractive for offloading model weights and KV caches that do not fit into on-device HBM. The value of such a tier in the context of AI inference ultimately depends on its effective bandwidth. While PCIe/CXL links are improving, for example PCIe 7.0 x16 targets 256 GB/s per direction [4], this is still an order of magnitude below the multi-terabyte/s bandwidth of a single HBM stack. On the device side, DDRclass channels behind CXL are even more limited, with DDR5-6400 offering only 51.2 GB/s per channel and DDR6-12800 projected to peak at 102.4 GB/s [5]. Consequently, a naive use of CXL memory risks stalling inference throughput on link and DRAM bottlenecks, undermining the cost benefits of memory pooling.

Prior work has primarily focused on application- and software-level optimizations, such as sparsity, KV eviction, routing, and static or dynamic quantization, to reduce bandwidth demand [6]-[9]. While valuable, these methods leave the intrinsic bandwidth limitations of CXL memory devices unchanged. Our approach is orthogonal, as we target the memory controller itself to amplify effective bandwidth in a way that complements software-level techniques. In this work, we present CXL-NDP, a transparent near-data processing architecture for CXL memory devices that ampli-

accessible.

protocol or requiring changes to AI applications. CXL-NDP introduces lightweight intelligence into the memory controller through two synergistic techniques. First, a precision-scalable bit-plane layout reorganizes floating-point values into bitplanes and exposes precision-partitioned logical regions, enabling CXL devices to support dynamic quantization of model weights and KV caches without incurring memory space overhead and while maintaining full bandwidth utilization. Second, transparent lossless compression leverages the bitplane layout from the first technique and augments it with additional schemes, including channel-wise KV reordering and exponent-delta encoding. Together, these components expose statistical regularity in both weights and KV cache, allowing general-purpose engines (LZ4/ZSTD [10], [11]) to achieve higher compression ratios and significantly amplify intra-CXL DRAM effective bandwidth. By combining quantization support with compression, the two techniques reinforce each other, which yield a stronger amplification of effective bandwidth.

Our evaluation shows that CXL-NDP delivers substantial benefits. Across public LLMs, it reduces memory footprint by 25.2% for weights and 46.9% for KV cache, with up to 2.69× compression in some layers, all without accuracy loss. DRAMSim3-based simulations demonstrate that precisionproportional fetch lowers DRAM access energy by up to 40.3% and reduces model-load latency by up to 42.1%. In an end-to-end evaluation serving a 70B parameter LLM on a single GPU, CXL-NDP boosts inference throughput by 43% at long context lengths and extends the maximum usable context by 87% compared to a conventional CXL memory system. RTL synthesis in a 7 nm process shows that a compression engine provisioned for multi-terabit throughput requires only 5.7 mm<sup>2</sup> of silicon area, confirming that high-speed near-data compression can be integrated at modest cost. In summary, this paper makes the following contributions:

- Transparent near-data processing for CXL memory.
  We propose CXL-NDP, the first architecture that amplifies effective bandwidth inside CXL memory devices without requiring changes to CXL.mem or to AI applications.
- 2) Transparent precision-scalable bit-plane layout. A device-side mechanism that exposes a logical multi-precision memory space, enabling CXL memory to support dynamic quantization of model weights and KV caches transparently under unmodified CXL.mem, while maintaining near-full effective bandwidth utilization and easing system integration.
- 3) Transparent lossless compression. A controller design that applies high-speed lossless compression to model weights and KV caches, reducing intra-device DRAM traffic and laying the foundation for future in-device computation.
- 4) **Comprehensive evaluation.** Quantitative results on LLaMA, OPT, and Mixtral showing up to 2.69× perlayer KV compression, 25.2% weight reduction, 40.3% lower DRAM energy, and 42.1% lower latency.
- 5) Hardware feasibility. RTL synthesis in 7 nm demon-

strating that multi-terabit/s compression engines can be integrated in  $\leq$ 5.7 mm<sup>2</sup>, validating the practicality of CXL-NDP.

#### II. BACKGROUND AND MOTIVATION

#### A. CXL as a Memory Building Block

Compute Express Link (CXL.mem) turns DDR-class memory behind a CXL Type-3 device into byte-addressable capacity that the OS can map and manage like regular memory [12]. In practice, accelerators keep the hot working set in HBM and place bulk weights and KV overflow in CXL memory. Recent studies show this increases model concurrency with small impact on token latency when placement and migration are done properly [3], [13].

The limiting factor is effective bandwidth inside the device. A single HBM3E stack delivers roughly 1 TB/s-class bandwidth, whereas the host link and the device-side DRAM sit far lower: PCIe 7.0 x16 targets 256 GB/s per direction, and DDR-class channels provide 51.2 GB/s at DDR5-6400 and 102.4 GB/s at DDR6-12800 [2], [4], [5]. Any wasted bytes on the device DRAM bus (reading full words when only a few bits are needed, poor row locality, or low-compressibility layouts) show up directly as performance degradation on the critical path.

Software integration is straightforward and does not require new APIs. Tiering policies can use standard NUMA placement, DAMON-style hot/cold tracking, or runtime allocators. Pooling across hosts is handled by a CXL switch and fabric manager, independent of the accelerator stack. What is missing today is a device-side mechanism that raises *effective* intra-CXL DRAM bandwidth under the unmodified CXL.mem interface. That is the focus of this work.

## B. Memory Challenges in LLM Inference

LLM inference is memory-bound. Each decoding step reprocesses the full input prefix through all layers and carries a growing key-value (KV) cache. For large models (e.g., LLaMA 3.1 405B [14], DeepSeek R1 671B [15]), weights alone consume massive capacity (about 750 GB for LLaMA 3.1 405B and 1543 GB for DeepSeek R1 671B). The KV cache then becomes the dominant consumer as sequence length grows. In LLaMA 3.1 8B, once the context extends to a few thousand tokens, KV exceeds 90% of the total memory footprint as shown in Fig. 1. This is a capacity problem first: without substantial DRAM or HBM provisioning, systems will swap, thrash, or fail out of memory.

It is also a bandwidth problem [6], [16], [17]. Autoregressive decoding requires each new token to read and compute against all model weights layer by layer. In parallel, attention fetches (and updates) the KV cache each step. Together, weights plus historical context impose large, repeated read traffic on the memory subsystem. Even with ample compute, throughput stalls if memory cannot serve these requests quickly. Cutting memory traffic is therefore as important as shrinking raw model size, because every partial stall stretches token generation latency.



Fig. 1. Percentage contribution of KV cache and model weights to total memory footprint with increasing sequence length in LLaMA 3.1 8B.

## C. Dynamic Quantization

Dynamic quantization adapts precision to context. It keeps high-importance data at higher precision (e.g., BF16/FP16) and reduces precision for less critical data (e.g., FP8/FP4). This improves compute efficiency and can lower memory traffic if the memory subsystem serves only the needed bits.

1) Dynamic quantization on model weights: Dynamic quantization on weights works best when we align precision to natural chunks in the model. We use two granularities: (i) per-expert with a modified Mixture-of-Depths-and-Experts (MoDE) [7], and (ii) per attention head and per MLP neuron with OPT [18].

**Granularity: Per-expert (MoDE).** We add precision control to MoDE with two routers (Fig. 2). Router 1 decides whether a block runs and sets its max precision using thresholds on score r (e.g.,  $r > 0.8 \rightarrow BF16$ ,  $0.6 < r \le 0.8 \rightarrow FP12$ ,  $0.4 < r \le 0.6 \rightarrow FP8$ ,  $0.2 < r \le 0.4 \rightarrow FP4$ , else skip). Router 2 assigns per-expert precision under that cap; low-score experts can be pruned (FP0). Shapes and capacities are unchanged, so kernels stay the same. On LLaMA-MoE-3.5B [19], lowering precision on more experts instead of skipping them improves zero-shot PIQA by +1.9 points while reducing effective compute and memory traffic, as shown in Fig. 3.



Fig. 2. Dynamic weight precision with MoDE: a block-level cap (Router 1) and per-expert assignments (Router 2).

**Granularity: Per-head** / **per-neuron** (**OPT**). For OPT 1.3B/13B/30B [18] on C4 [24] and WikiText [25], we compare: (i) FP16 baseline, (ii) static-uniform precision (one format for all weights), and (iii) dynamic importance-aware precision per head and per neuron using a predictor [6]. The predictor emits a score in [0,1]; thresholds map to



Fig. 3. MoDE per-expert precision vs prune-only on LLaMA-MoE-3.5B for PIQA [20], WinoGrande [21], LAMBADA [22], and MMLU [23].

{FP16, FP12, FP8, FP6, FP4, FP0} while sweeping average bits/weight. Dynamic precision consistently beats static-uniform at the same bits and often matches higher-bit static-uniform with fewer bits. For example, as shown in Fig. 4, OPT-13B on C4 at 8 bits reduces perplexity from 16.39 (static) to 12.70 (dynamic); at 6.4 bits, dynamic reaches 13.89, better than static 8-bit at 16.39.



Fig. 4. Perplexity vs average bits/weight on OPT. Dynamic per-head/perneuron precision outperforms static-uniform at the same bits. Lower is better.

2) Dynamic quantization on KV cache: We quantize the KV cache at page granularity. A page is a fixed-size window of tokens for a given layer and head that contains many K and V vectors (all channels for those tokens). Page-level decisions are stable over short windows, cheap to enact at runtime, and match how attention reuses nearby context. Quest [9] provides a strong baseline. It ranks KV pages by runtime importance (e.g., recency and cumulative attention), keeps the top N pages in BF16, and prunes all remaining pages. On LLaMA 3.1 8B [14] with BookSum [26], keeping the top 5 pages in BF16 yields perplexity 12.49 (vs. 10.49 full precision). We use the same ranking but assign multiple precision tiers to capture the long tail of importance. Two tiers (Top 5 pages BF16, Next 5 FP8) reach perplexity 11.60. A threetier variant (Top 5 BF16, Next 3 FP8, Next 2 FP4) reaches 11.87. These results show that page-wise, multi-tier precision preserves quality better than a single lower-precision bucket while keeping runtime control simple and coarse-grained.

TABLE I
PERPLEXITY FOR VARIOUS QUANTIZATION METHODS ON LLAMA 3.1 8B
MODEL AND BOOKSUM DATASET

| Method                                                             | Perplexity |
|--------------------------------------------------------------------|------------|
| Full KV Cache                                                      | 10.49      |
| Sliding Window (64 tokens)                                         | 14.33      |
| Quest (Top 5 pages in BF16)                                        | 12.49      |
| Dynamic Quant. (Top 5 pages in BF16, Next 3 in FP8, Next 2 in FP4) | 11.87      |
| Dynamic Quant. (Top 5 pages in BF16, Next 5 in FP8)                | 11.60      |

## D. AI Accelerators Support Variable-Precision

Modern AI accelerators now ship with native variableprecision arithmetic, enabling dynamic quantization to scale compute throughput with precision. For example, NVIDIA Blackwell (B200) supports FP64/3216/8/6/4 and INT8. Public specs for GB200 NVL72 list per-GB200 superchip (two GPUs) peaks of 10 PFLOPS (FP16/BF16), 20 PFLOPS (FP8/FP6), and 40 PFLOPS (FP4) with structured sparsity, which implies roughly 5/10/20 PFLOPS per B200 GPU [27]. System-level DGX B200 numbers (eight B200 GPUs) report 72 PFLOPS for training and 144 PFLOPS for inference [28], consistent with about 9 PFLOPS (FP8-class) and 18 PFLOPS (FP4-class) per GPU in dense mode, which matches partner summaries of  $\approx 4.5/9/18$  PFLOPS at FP16/FP8/FP4. These figures confirm that compute scales near-linearly as precision drops inside the SMs; the open question is whether memory bandwidth and energy co-scale when quantization is applied inside the accelerator rather than in the memory system.

## E. Lossless Compression on LLM

Lossless compression cuts memory footprint without hurting accuracy in LLM inference, but generic codecs are weak on raw LLM tensors. Table II shows that LZ4 [10] often yields no savings on weights or KV, and ZSTD [11] is modest on weights (about 17–23%) and barely helps on KV (about 1–7%) across common models and BookSum.

TABLE II

MODEL WEIGHTS AND KV CACHE FOOTPRINT REDUCTION UNDER
LOSSLESS COMPRESSION.

| Codecs        | LLaMA 3.1 8B | Gemma 2 2B     | Mistral 7B   | OPT 13B | Mixtral 8×7B |  |  |  |
|---------------|--------------|----------------|--------------|---------|--------------|--|--|--|
| Model Weights |              |                |              |         |              |  |  |  |
| LZ4           | 0.0%         | 11.5%          | 0.0%         | 0.0%    | 18.0%        |  |  |  |
| ZSTD          | 20.6%        | 23.0%          | 17.3%        | 19.4%   | 21.3%        |  |  |  |
|               |              | KV Cache on Bo | okSum Datase | et      |              |  |  |  |
| LZ4           | 0.0%         | 0.0%           | 0.0%         | 0.0%    | 0.0%         |  |  |  |
| ZSTD          | 6.5%         | 2.9%           | 0.9%         | 2.0%    | 3.8%         |  |  |  |

Two factors explain this. First, IEEE floats mix sign, exponent, and mantissa at the byte level. Byte-oriented codecs then see high entropy, especially when exponents vary, so repeated structure is hidden even when magnitudes are similar. Second, KV is written token by token. That layout reflects per-token variation and hides the correlation that exists across tokens along the channel dimension.

Prior work points to where the structure is that channels correlate across adjacent tokens [29], [30]. For LLaMA 2, grouping KV by channel reduced relative reconstruction error

from 13.67 (token grouping) to 4.55 [29]. This motivates a layout-first approach: reorganize floating-point fields to expose similarity, and store KV in channel-major windows with compact exponent metadata (for example, a base exponent per channel plus small deltas). With layouts that reflect how tensors vary, standard lossless codecs become much more effective while keeping values exact. The design section builds on these observations.

#### III. PROPOSED DESIGN

CXL-NDP is a near-data controller inside a CXL Type-3 device that targets limited bandwidth and capacity. It raises effective bandwidth under an unmodified CXL.mem interface by (i) serving only the bit-planes required by the runtime precision and (ii) losslessly compressing weights and KV to cut in-device bytes. CXL-NDP follows:

- Bit-plane layout with precision-proportional fetch. We store sign, exponent, and mantissa as contiguous bitplanes. Reads from a precision-partitioned region fetch only the most significant planes needed for that format plus optional guard planes for rounding, then reassemble on-device. Plane blocks are row-aligned (2–8 KB uncompressed) to amortize activates and sustain long bursts.
- 2) Channel-major KV reorganization. On writes, the controller buffers a small window per head, transposes KV to channel-major across tokens, and then compresses per-plane. This keeps values exact, increases row locality, and improves compression. On reads, we reconstruct the native token-major layout.
- 3) Precision-partitioned address space. The device exposes disjoint logical regions, one per target precision. The host selects precision by reading the corresponding region. Internally all regions map to the same plane image; the controller performs plane selection and assembly transparently.
- 4) On-device lossless block compression. Per-plane blocks are compressed before DRAM commit (KV) and stored pre-compressed (weights). LZ4 is used for low latency, ZSTD for higher ratio, with raw bypass when not profitable. Headers record plane id, sizes, codec, checksum, and guard flags.

Each DRAM access moves fewer bytes along two axes: fewer planes at lower runtime precision and fewer bytes per plane when compression is effective. The host issues normal CXL.mem loads. The controller handles plane selection, (de)compression, and reassembly at line rate, including a raw fast path for incompressible data.

A. Bit-plane Disaggregation for Lossless Compression and Precision-Proportional Fetch

**Bit-plane disaggregation.** Floating-point data (e.g., FP16) compress poorly when stored per number because each word mixes sign, exponent, and mantissa bytes, which hides regularity from byte-oriented compressors. We instead reorganize data by *bit-plane disaggregation*: store the same bit position of many values contiguously as shown in Fig. 6. This delivers two



Fig. 5. Overview of proposed design: Mitigating memory bottlenecks by enhancing on-chip memory controller within AI accelerators.

concrete benefits. First, it improves *lossless compression* because the sign plane, exponent planes, and the most significant mantissa planes exhibit lower entropy and compress well with LZ4 or ZSTD. Second, it enables *precision-proportional fetch*: for any requested reduced-precision format, the device reads only the sign plane and just the most significant exponent and mantissa planes needed for that format, cutting internal DRAM activations and burst transfers to match runtime precision. We detail the per-plane layout, selective retrieval, and optional rounding next.



Fig. 6. Bit-plane disaggregation: values are reorganized by bit position to expose compressible planes and enable precision-proportional reads.

Let the format have n=1+E+M bits: 1 sign, E exponent bits, and M mantissa bits. Consider a block of m values  $\{x_1,\ldots,x_m\}$ . For value j: (i)  $b_{j,0}^{\mathrm{sgn}}\in\{0,1\}$  is its sign bit. (ii)  $b_{j,i}^{\mathrm{exp}}$  is its exponent bit at position  $i\in[0,E-1]$  (we index i=0 as the least significant exponent bit and i=E-1 as the most significant). (iii)  $b_{j,i}^{\mathrm{man}}$  is its mantissa bit at position  $i\in[0,M-1]$  (similarly, i=0 is the least significant mantissa bit).

We form field-specific planes as m-bit rows, corresponding to the sign  $(P^{\mathrm{sgn}})$ , exponent  $(P^{\mathrm{exp}}_i)$ , and mantissa  $(P^{\mathrm{man}}_i)$  fields:  $P^{\mathrm{sgn}} = \{b^{\mathrm{sgn}}_{1,0}, \ldots, b^{\mathrm{sgn}}_{m,0}\}, P^{\mathrm{exp}}_i = \{b^{\mathrm{exp}}_{1,i}, \ldots, b^{\mathrm{exp}}_{m,i}\}, P^{\mathrm{man}}_i = \{b^{\mathrm{man}}_{1,i}, \ldots, b^{\mathrm{man}}_{m,i}\}.$  Stacking planes yields the bit-

plane matrix

$$\mathbf{P} = \begin{bmatrix} P^{\text{sgn}} \\ P^{\text{exp}}_{0} \\ \vdots \\ P^{\text{exp}}_{E-1} \\ P^{\text{man}}_{0} \\ \vdots \\ P^{\text{man}}_{M-1} \end{bmatrix} = \begin{bmatrix} b^{\text{sgn}}_{1,0} & \cdots & b^{\text{sgn}}_{j^{*},0} & \cdots & b^{\text{sgn}}_{m,0} \\ b^{\text{exp}}_{1,0} & \cdots & b^{\text{exp}}_{j^{*},0} & \cdots & b^{\text{exp}}_{m,0} \\ \vdots & \ddots & \vdots & \ddots & \vdots \\ b^{\text{exp}}_{1,E-1} & \cdots & b^{\text{exp}}_{j^{*},E-1} & \cdots & b^{\text{exp}}_{m,E-1} \\ b^{\text{man}}_{1,0} & \cdots & b^{\text{man}}_{j^{*},0} & \cdots & b^{\text{man}}_{m,0} \\ \vdots & \ddots & \vdots & \ddots & \vdots \\ b^{\text{man}}_{1,M-1} & \cdots & b^{\text{man}}_{j^{*},M-1} & \cdots & b^{\text{man}}_{m,M-1} \end{bmatrix}$$

$$(1)$$

Higher-order planes (sign, upper exponent planes, and the most significant mantissa planes) typically have lower entropy and compress better.

**Per-plane lossless compression and placement.** CXL-NDP maintains per-plane buffers (typically 1–4 KB), compresses each plane independently with a lossless block compressor (e.g., LZ4/ZSTD), and stores planes as independent objects in DRAM. A compact header records plane IDs and block metadata. Because planes are row-aligned and independent, a read touches only the compressed bytes of the planes required by the request. Internal page activations and burst transfers scale with (i) how many planes are read and (ii) how compressible those planes are. Values remain exact.

Selective retrieval for dynamic precision. With plane placement, a runtime precision change is a plane filter. Let the full-precision format have  $N_1 = 1 + E + M$  bits (one sign, E exponent, M mantissa). If the host requests a chunk under a reduced floating-point format with  $1+r_E+r_M$  bits, CXL-NDP reads exactly the following set of planes from DRAM:  $\{P^{\rm sgn}\} \cup \{P^{\rm exp}_{E-r_E}, \ldots, P^{\rm exp}_{E-1}\} \cup \{P^{\rm man}_{M-r_M}, \ldots, P^{\rm man}_{M-1}\}$ , and decompresses only those planes on the device. No CXL.mem change is required; precision selection is encoded by the logical address region and is handled entirely inside the CXL-NDP controller.

**Optional rounding with guard planes.** By default we truncate. To support round-to-nearest instead of truncation, the controller fetch a small number of *guard* planes in addition to the target planes. Let  $d_E, d_M \in \{0, 1, 2\}$  denote how many

extra exponent and mantissa guard planes to read (0 means truncate; > 0 enables rounding and carry handling). The controller then fetches  $1+(r_E+d_E)+(r_M+d_M)$  planes in total, performs rounding on-device, and emits the exact  $(1+r_E+r_M)$ -bit result to the host.

## B. Cross-Token KV Cache Clustering and De-correlation

Cross-token KV cache clustering and de-correlation enhance compressibility by grouping KV cache tensors across multiple tokens and organizing them based on positional alignment. For simplicity, we represent each token's KV tensor as a vector, denoted by  $\mathbf{k}_t$  for token t, where each position in  $\mathbf{k}_t$  corresponds to an embedding dimension. By aligning these vectors within a group of n tokens, we create a matrix structure  $G_j$  that captures redundancy more effectively. Grouping KV entries in this way allows for more efficient memory usage by handling data at a group level, improving organization and compressibility.

Channel-wise grouping across tokens. As shown in Fig. 7  $\bigcirc$ , in each token group G, we organize KV vectors  $\mathbf{k}_t$  by aligning entries at the same position across all tokens in the group, thereby forming a matrix structure for each channel. For each channel j, representing a specific entry within each token's KV vector, we collect the entries at channel j across all tokens in the group:

$$G_i = \{k_{t,i} \mid t = 0, \dots, n-1\}.$$
 (2)

Here, each  $k_{t,j}$  represents the entry at channel j in token t's KV vector  $\mathbf{k}_t$ , and  $G_j$  becomes a row of entries aligned by channel across tokens within the group G. This structure enhances compressibility by aligning similar data elements.



Fig. 7. Illustration of cross-token KV cache clustering and de-correlation for a group of n tokens, showing channel-wise grouping across tokens, bit-plane disaggregation and concatenation and exponent delta transformation.

## Bit-plane disaggregation and concatenation on KV cache.

To enhance compressibility, we begin by organizing each entry in  $G_j$  into bit-planes, isolating each bit position across all tokens in a group, as shown in Fig. 7 2. Each KV entry  $k_{t,j}$  is represented as a binary sequence, and the *i*-th bit-plane  $P_i(G_j)$  for  $G_j$  (structured as a matrix) is:

$$P_i(G_i) = \{ \text{Bit}(k_{t,i}, i) \mid t = 0, \dots, n - 1 \}.$$
 (3)

Once all bit-planes are extracted, we concatenate these bit-planes across all positions j within each group to form a single bit-plane sequence, as illustrated in Fig. 7  $\bigcirc$ :

Concatenated\_Bitplane
$$(G_i) = \bigcup_{j=0}^{J-1} P_i(G_j),$$
 (4)

where J is the number of positions within the KV vectors. **Exponent delta transformation.** Following bit-plane disaggregation, we apply an exponent delta transformation to reduce the range of exponent values. For each token group, we identify a base exponent  $\beta_j$  — the minimum or most common exponent across all tokens for position j. Each exponent in  $G_j$  is then transformed relative to  $\beta_j$ , as shown in Fig. 7 3. The transformed exponent for each KV entry  $k_{t,j}$  is then:

$$\delta_{t,j} = \text{Exponent}(k_{t,j}) - \beta_j.$$
 (5)

The delta-transformed form of  $G_j$  then consists of the base exponent  $\beta_j$  and the transformed delta values:

$$G_i = \{\beta_i, \delta_{t,i} \mid t = 0, \dots, n-1\}.$$
 (6)

Finally, the delta-transformed bit-planes are packed into bytes and compressed, optimizing memory usage and enabling efficient data retrieval during inference.

From a hardware perspective, the memory controller includes a dedicated *channel-wise KV aggregator* module that buffers token embeddings (e.g., a batch of n tokens) and rearranges them so that channel j values appear contiguously. A small integer subtractor computes the exponent delta  $\delta_{t,j}$  relative to  $\beta_j$ , which is stored in a per-channel metadata buffer. The bit-plane shuffle network then disaggregates  $\delta_{t,j}$  and mantissa bits into planes, similar to the model-weight path. Finally, an on-chip block compression engine (LZ4/ZSTD) encodes the rearranged KV data. During reads, the controller reverses these steps by decompressing the bit-planes, restoring exponents via  $\beta_j + \delta_{t,j}$ , and outputting the original per-token KV layout. In practice, additional small header fields (one base exponent per channel) are stored with each block.

**KV windowing and metadata.** We buffer a window of n tokens per head to form channel-major blocks. For each channel we store a 1-byte base exponent; per block we store a 12-byte header as shown in III.

TABLE III KV METADATA PER BLOCK

| Field                                | Size        | Notes                              |
|--------------------------------------|-------------|------------------------------------|
| Block header<br>Per-channel base exp | 12 B<br>1 B | id, planes, codec, sizes $\beta_j$ |

With d channels and window n, metadata is d+12 bytes. For LLaMA 8B with d=128 per head and n=256, metadata is 140 bytes, under 0.1% of the block payload at 4 KB. Adjust d and n if your exact shapes differ.

## C. Precision-Partitioned Logical Address Space

CXL-NDP exposes multiple logical regions over the same physical model image, each region corresponding to a target

quantization format. Let L be the total number of weights, s the number of supported formats, and  $N_i$  the bit-width of format i (with  $N_1$  as full precision). The device stores the  $L \cdot N_1$  bits of the model as bit-planes in DRAM. It then exposes s logical regions  $\{P_i\}$ : (i) Region  $P_i$  has a logical size of  $L \cdot N_i$  bits. (ii) All regions together span  $\sum_{i=1}^s L \cdot N_i$  bits of logical space. (iii) Physically, only  $L \cdot N_1$  bits exist (i.e., the full-precision planes).



Fig. 8. Illustration of CXL precision-partitioned logical address space.

**Accessing regions.** A host request for a length-l chunk in format i is a sequential read of  $l \cdot N_i$  bits from  $P_i$ . The controller translates this into fetches of the sign plane plus the most significant  $N_i-1$  exponent/mantissa planes (and optionally a few guard planes for rounding). It then assembles the requested format and returns it over CXL.mem.

**Mapping and TLB.** Each region  $P_i$  is exposed as a disjoint physical range in the CXL address space and can be mapped with standard page tables. We recommend 2MB huge pages per region to avoid TLB pressure. The runtime chooses precision by selecting which region to read. Mixed-precision within a layer is implemented by launching multiple DMA streams bound to different regions.

Precision regions are carved per device partition. A CXL switch manager assigns region ranges to tenants; there is no cross-tenant sharing of plane objects. Pooling exposes capacity as a sum of region ranges without changing host software.

## D. Controller architecture with device-side lossless compression

CXL-NDP sits between the CXL.link and device DRAM and turns precision requests into plane-selective DRAM fetches with on-device lossless (de)compression. Fig. 9 shows the memory controller microarchitecture. The datapath streams at line rate through four stages.

- Request front-end. Parses CXL.mem reads/writes, checks the precision-partitioned region, and emits an internal plane request (bit-planes + optional guard planes). A 64-entry MSHR (miss status holding register) tracks outstanding bursts and supports out-of-order reassembly.
- 2) Plane index and metadata. Maps each logical chunk to plane IDs, DRAM row ranges, codec tags, and perblock sizes. For KV writes, it also holds the 1-byte base exponent per channel. All metadata sits in a 2 MB SRAM (two banks) with single-cycle lookup.



Fig. 9. CXL-NDP memory controller microarchitecture. The controller exposes precision-partitioned regions to the host, maps requests to bit-planes, and fetches only sign and most-significant exponent/mantissa planes. A plane index in SRAM guides on-device (de)compression and reassembly. KV writes are buffered, transposed to channel-major, and losslessly compressed before DRAM commit. An FR-FCFS scheduler with per-bank plane FIFOs maximizes row locality on DDR5 channels.

- 3) Codec complex. A lane group {demux → decompress/compress → mux}. Each lane sustains 512 Gb/s at 2 GHz. We instantiate 32 lanes (2 TB/s aggregate), which is above both device DRAM bandwidth and a PCIe 7.0 x16 link target.
- 4) **DRAM-side scheduler and plane buffers.** Per-bank plane FIFOs (4–8 KB) align block boundaries to rows and merge fetches across adjacent planes. The scheduler is FR-FCFS (first-ready, first-come-first-serve) with starvation caps and favors row hits. KV writes flow through a small window buffer (e.g., n=256 tokens per head) to form channel-major blocks, then enter the codec complex before commit.

**Read pipeline.** For a read to region  $P_i$ , the front-end emits the plane set for format i. The scheduler fetches only those planes. The codec complex decompresses them on the fly, and the reassembler emits IEEE layout in the target format. Guard planes are fetched only when round-to-nearest is requested. Because we touch fewer planes and those planes are compressed, internal DRAM bytes drop along two axes: fewer planes at lower precision and fewer bytes per plane when compression is effective.

Write pipeline (KV path). Incoming KV updates are buffered per head and sequence, transposed to channel-major inside the window, exponent deltas are computed relative to a 1-byte base exponent per channel, then planes are formed and compressed before DRAM commit. Per-block metadata is a 12-byte header plus 1 byte per channel, typically under 0.1% of payload.

**Timing note.** The codec complex is provisioned at 2 TB/s,

so steady-state throughput is limited by the CXL link or DRAM, not the controller. Bit-plane demux/reassembly is deeply pipelined and overlaps with DRAM access. At 2 GHz with 32 lanes, the codec complex sustains 2 TB/s. Under sequential access, plane decompression is fully overlapped with DRAM reads, so CXL-NDP introduces no steady-state time overhead relative to raw bytes. Cold-start adds tens of ns that are typically hidden by DRAM activate.

**Batching and KV paging.** CXL-NDP scales cleanly to larger batch sizes. The KV window buffer is provisioned per head and per sequence; with batch >1, we either form windows per sequence (simplest) or co-pack the same head across multiple sequences into a larger window, which produces longer, contiguous plane bursts and typically improves compression ratio and row-buffer hits. The tradeoff is SRAM for bigger windows. In our prototype, a few megabytes across all heads is enough for batch sizes used in production servers. The design also works with paged KV backends (e.g., vLLM): on a pagein, we read compressed channel-major planes, reconstruct token-major, and return a normal page; on eviction, we capture the writeback stream, rebuild channel-major in the window, and commit compressed bytes. Plane blocks are aligned to the KV page size, and the directory tracks per-page block locations. When KV mostly resides in the CXL tier, more traffic hits compressed, channel-major data, further cutting internal DRAM bytes.

## IV. EXPERIMENT

## A. End-to-end performance

We compare CXL-NDP to a byte-level baseline in a single-GPU setup that serves LLaMA 3.1 70B on an H100. The model is 140 GB, the GPU has 80 GB HBM, so weights and KV are offloaded to a 256 GB CXL tier. KV pages are 32 tokens.

Fig. 10 shows higher throughput and longer context. At 65k tokens, CXL-NDP is +43% faster in the 5 bit/weight mode by moving only the required bit-planes instead of full 16-bit words. The baseline exhausts the 256 GB CXL tier at  $\sim\!105k$  tokens and falls back to disk. CXL-NDP extends the usable context by 87% before that cliff.



Fig. 10. End-to-end inference throughput for the 70B model on a single H100 with CXL offload. CXL-NDP scales throughput with requested precision and pushes the context-length limit by compressing KV in the CXL tier.

Fig. 11 breaks down one 16 KB weight-chunk read. Totals are 398 ns (5-bit LZ4), 373 ns (5-bit ZSTD), 718 ns (10-bit LZ4), and 673 ns (10-bit ZSTD). In all cases the path is DRAM-burst bound: codec and reassembly are tiny and fully hidden, with  $t_{\rm decomp} + t_{\rm reasm} = 4.0 + 5.0 = 9.0$  ns (LZ4) and 7.5 + 5.0 = 12.5 ns (ZSTD), while the DRAM burst spans 295–640 ns.



Fig. 11. Weight-read micro-timelines at 5/10 bit/weight with LZ4 and ZSTD. Codec work is hidden relative to the DRAM burst; totals are 398/373 ns (5-bit LZ4/ZSTD) and 718/673 ns (10-bit LZ4/ZSTD).

Fig. 12 shows the KV write for a 32-token window with overlap: transpose  $\rightarrow$  delta  $\rightarrow$  compress  $\rightarrow$  DRAM write. Pipeline totals are 5095 ns (LZ4) and 4976 ns (ZSTD), or 159.2/155.5 ns per token. The pipeline is DRAM-write bound:  $t_{\rm write} = 5733$  ns (LZ4) and 5523 ns (ZSTD), whereas transpose is 1024 ns, delta 51 ns, and compression just 32–60 ns. Codec and reformat time are negligible relative to DRAM.



Fig. 12. KV-write pipeline with planned overlaps. Totals are 5095 ns (LZ4) and 4976 ns (ZSTD); 159.2/155.5 ns per token. DRAM write dominates; transform and compression are small.

## B. Compression Efficiency

1) KV Cache Compressibility: We evaluated the compressibility of the KV cache across 32 layers of the LLaMA 3.1



Fig. 13. KV cache lossless compression ratio by layer  $(S_{\text{orig}}/S_{\text{comp}})$ , higher is better) for LLaMA 3.1 8B (32 layers) on WikiText and BookSum (LongBench). All results use a bit-plane layout with 4KB blocks and LZ4/ZSTD. *Proposed*: adds cross-token, channel-wise KV clustering with exponent-delta de-correlation. *Baseline*: bit-plane layout only, without clustering or de-correlation.

8B model on the WikiText dataset [31] and BookSum (Long-Bench) datasets [26]. In this work, we define compression ratio as  $S_{orig}/S_{comp} \ge 1$ , where  $S_{orig}$  and  $S_{comp}$  denote the size of the original and compressed data blocks. Compression ratios were measured for both LZ4 and ZSTD algorithms with a 4KB compression block size, as shown in Fig.13. Our data placement strategy on the KV cache achieved 44.8% and 46.9% overall footprint reduction on WikiText and Booksum task. On WikiText, the highest compression ratios on a single layer reached 2.69 (ZSTD) and 2.31 (LZ4), while on BookSum, they peaked at 2.10 (ZSTD) and 1.93 (LZ4). Compared to the baseline (with overall ZSTD compression ratios of 1.21 on WikiText and 1.33 on BookSum), which does not apply crosstoken KV cache clustering and de-correlation, our approach (with overall ratios of 1.81 on WikiText and 1.88 on BookSum) improves the overall KV cache lossless compression ratio by 50.3% on WikiText and 41.7% on BookSum using ZSTD.

2) Model Weights Compressibility: Table IV presents compression ratios and corresponding memory savings for various LLM configurations across different precision levels on model weights. BF16-based models achieve the highest lossless compression gains; for instance, LLaMA 3.1 8B in BF16 precision achieves a ZSTD lossless compression ratio of 1.34, leading to a 25.2% reduction. Moreover, since the proposed method is orthogonal to recent lossy compression techniques, it can also cooperate effectively with quantization approaches (e.g., GPTQ [8]) to amplify total memory savings. For example, for FP8 and INT4 precision models, combining our method with AutoFP8 and GPTQ lossy compression, starting from BF16, results in substantial overall savings. For instance, quantizing LLaMA 3.1 8B to FP8 achieves a 54.1% total reduction, merging a 50% lossy saving with an additional 8.3% from our lossless compression approach.

Fig. 14 illustrates the compressibility of model weights across bit-planes for BF16, FP8, and INT4-based LLMs using 4KB ZSTD lossless compression, along with KV cache compressibility in the LLaMA 3.1 8B model on WikiText and BookSum datasets. For BF16 model weights, the top four exponent bit-planes contribute the most to overall compressibility, and together with other bit-planes it achieves an overall compression ratio of 1.34. This is because exponents, especially in high-precision formats like BF16, often contain more redundancy and fewer unique values, allowing for effec-

TABLE IV

LOSSLESS COMPRESSION RATIOS AND TOTAL MEMORY SAVINGS WITH

LOSSY COMPRESSION ON MODEL WEIGHTS

| Model          | Precision | Comp. Ratio | Lossless Savings | Total Savings |
|----------------|-----------|-------------|------------------|---------------|
| LLaMA 3.1 8B   | BF16      | 1.34        | 25.2%            | 25.2%         |
|                | FP8       | 1.09        | 8.3%             | 54.1%         |
|                | INT4      | 1.01        | 0.9%             | 75.2%         |
| LLaMA 3.1 70B  | BF16      | 1.34        | 25.6%            | 25.6%         |
|                | FP8       | 1.10        | 9.3%             | 54.6%         |
|                | INT4      | 1.02        | 2.1%             | 75.5%         |
| Mixtral 8×7B   | BF16      | 1.32        | 24.4%            | 24.4%         |
|                | FP8       | 1.09        | 8.0%             | 54.1%         |
|                | INT4      | 1.01        | 1.2%             | 75.3%         |
| LLaMA MoE 3.5B | BF16      | 1.33        | 24.9%            | 24.9%         |
|                | FP8       | 1.11        | 9.9%             | 54.9%         |
|                | INT4      | 1.02        | 1.6%             | 75.4%         |

tive compression. In contrast, FP8 and INT4 models, already subjected to lossy quantization, show limited compressibility as the reduced bit precision minimizes representational redundancy, particularly in the exponent bits.



Fig. 14. Compressibility of model weights and KV cache bit-planes in the LLaMA 3.1 8B model, evaluated for BF16, FP8, and INT4 weight formats and BF16 KV cache on WikiText and BookSum datasets, utilizing ZSTD compression with 4KB blocks.

For the KV cache in BF16 format, shown in the lower two subfigures of Fig. 14, the exponent bit-planes again demonstrate significantly higher compressibility. This is attributed to the relatively narrow range of data stored along the channel in KV cache, where exponents frequently exhibit low variability



Fig. 15. Precision distribution for model weights of LLaMA 3.1 8B, LLaMA 3.1 70B, Mixtral 8×7B, and LLaMA-MoE-3.5B in BF16, quantized FP8, and INT4 when conducting inference on WikiText-2.

across tokens. These properties lead to substantial memory savings of 44.8% on the WikiText and 46.9% on BookSum (as shown in Fig. 13).

## C. DRAM Access Efficiency with Dynamic Quantization

We evaluate how device-side plane-aligned fetch translates runtime precision into fewer DRAM bytes across two granularity settings: (i) per-expert dynamic precision in MoE-style models (LLaMA 3.1 8B/70B [32], Mixtral 8×7B [33], and LLaMA-MoE-3.5B [19]), and (ii) per-head and per-neuron dynamic precision in OPT 30B [18]. All DRAM studies use DRAMSim3 [34] with 4 channels per module and  $10\times4$  DDR5-4800 devices per channel.

MoDE models (per-expert). We adapt LLaMA 3.1 8B/70B and Mixtral 8×7B to a Mixture-of-Depth-and-Experts (MoDE) control flow [7] that caps per-block precision and assigns expert precision within that cap. Dense MLPs in LLaMA are converted to MoE to expose expert-level control []. We reduce calibration cost with LoRA [35] on C4, and prepare FP8 and INT4 variants using AutoFP8 [36] and GPTQ [8] on UltraChat [37]. Fig. 15 plots the average runtime precision mix on WikiText-2 across 12 resulting LLMs: BF16 bases sweep BF16/FP12/FP8/FP6/FP4, FP8 bases sweep FP8/FP6/FP4, and INT4 bases sweep INT4/INT2; routers stay in BF16 for accuracy.

We compare our proposed plane-aligned method (denoted P) against a traditional byte-level fetch that always reads full words and post-converts (denoted T). Fig. 16 shows up to 29.9% lower DRAM access energy with P. On BF16 bases, P reduces energy by 27.8% (LLaMA 3.1 8B), 25.9% (LLaMA 3.1 70B), 29.9% (Mixtral 8×7B), and 27.2% (LLaMA-MoE-3.5B). On Mixtral 8×7B with quantized bases, savings remain but taper as intrinsic precision drops: 19.6% for FP8 and 17.9% for INT4. Fig. 17 reports corresponding load-latency gains up to 30.0%. For example, Mixtral 8×7B BF16 falls from 705.90 to 495.06 ms (30.0%), and LLaMA 3.1 70B BF16 drops from 910.58 to 674.73 ms (25.9%); FP8 and INT4 variants also improve (e.g., 348.65→293.27 ms and 251.03→214.11 ms on LLaMA 3.1 70B).

**OPT 30B (per-head and per-neuron).** To validate at finer granularity, we reuse the OPT 30B setup where dynamic precision is chosen per attention head and per MLP neuron. We treat all weights in one head or one neuron as a single chunk, which maps to the device's plane stripes and keeps fetch proportional. Each head contains  $3.7 \times 10^6$  weights



Fig. 16. **Per-expert granularity:** DRAM access energy per weight for models LLaMA 3.1 8B, LLaMA 3.1 70B, Mixtral 8×7B, and LLaMA-MoE-3.5B in BF16, quantized FP8, and INT4 when conducting inference on WikiText-2. We compared read and activation energy for Proposed bit-plane (P) and Traditional byte-level (T) approach.



Fig. 17. **Per expert granularity:** Average model load latency for models LLaMA 3.1 8B, LLaMA 3.1 70B, Mixtral 8×7B, and LLaMA-MoE-3.5B in BF16, quantized FP8, and INT4 when conducting inference on WikiText-2.

and each MLP neuron  $7.2 \times 10^3$  weights. The proposed bitplane method (equivalent to the plane-aligned P path above) is compared to a traditional word-wise fetch (T). Fig. 18 shows total energy when loading the full model once, with bitplane layout reducing DRAM access energy by up to 40.3%. Fig. 19 detail per-weight energy: for attention heads at target 1.6/4.8/8.0 bits-per-weight, T costs 49.6/118.9/238.9 pJ while P costs 34.5/70.8/141.2 pJ, i.e., 30.5%, 40.4%, and 40.9% reductions. For MLP neurons, P lowers energy by 19.4%, 20.3%, and 33.9% at the same settings. Latency trends match energy. Fig. 20 shows that per-head load latency falls by 36.2%, 40.6%, and 42.1% at 1.6/4.8/8.0 bits-per-weight, and per-neuron latency drops by 24.8%, 27.9%, and 38.4%.

Across per-expert and per-head/neuron policies, the deviceside plane layout turns runtime precision into proportional DRAM work: reads touch only the sign and the most signifi-



Fig. 18. **Per-head and per-neuron granularity:** Comparison of total DRAM access energy when loading the entire OPT 30B model once. The traditional (T) byte-level layout is compared against the proposed (P) bit-plane layout across different average bits/weight targets. The 'Baseline-16' bar shows the energy for a full 16-bit model load.



Fig. 19. **Per-head and per-neuron granularity:** Per-weight DRAM access energy in the OPT 30B model for per-head and per-neuron dynamic quantization granularity, comparing the traditional (T) and proposed (P) methods at (a) attention head granularity and (b) MLP neuron granularity. The stacked bars show the breakdown of read and activation energy for different average bits/weight targets.



Fig. 20. **Per-head and per-neuron granularity:** Average load latency in the OPT 30B model on WikiText across different dynamic quantization settings. The figure compares the traditional (T) byte-level layout with the proposed (P) bit-plane layout for (a) individual attention heads and (b) individual MLP neurons.

cant exponent and mantissa planes required by the requested format, and lossless per-plane compression further shrinks the bytes moved. The savings are largest when the base model retains wider formats (BF16), and they decline as the model is already compressed to FP8 or INT4, which is expected. The above results show consistent reductions in DRAM energy (up to 40.3%) and load latency (up to 42.1%) while preserving accuracy under dynamic quantization.

## D. Hardware Implementation and Resource Evaluation

We implemented a parameterizable RTL design of our bitplane–aware compression subsystem in SystemVerilog. The design includes three main modules: (1) a bit-plane aggregator for shuffling data into and out of bit-plane format, (2) a multilane compression engine supporting both LZ4 and ZSTD, and (3) control logic for managing block-based I/O and interfacing with the main memory controller.

Table V summarizes the hardware resource usage, synthesized for a 7 nm process technology [38] at a target frequency of 2 GHz. Each of the 32 parallel lanes is designed to process data at 512 Gb/s, yielding an aggregate *internal* processing bandwidth of 2 TB/s (16384 Gb/s). This internal bandwidth is intentionally overprovisioned to saturate a high-speed external CXL link (e.g., 256 GB/s for a PCIe 7.0 x16 link). This design ensures that the on-device processing pipeline does not become a bottleneck, even under worst-case, low-compressibility workloads where the hardware must still process data at the full line rate.

The silicon cost for this level of performance is modest. For an LZ4-based configuration with a 65536-bit block size, the 32-lane subsystem occupies 4.83 mm<sup>2</sup> and consumes 5.25 W. The more complex ZSTD engine requires 5.69 mm<sup>2</sup> and 7.38 W under the same configuration.

### V. RELATED WORK

Bit-Plane Disaggregation: The use of bit-plane disaggregation for improving data compressibility and hardware efficiency has been explored extensively in both classical and contemporary literature. Early works such as BPC [39] by NVIDIA highlight that regrouping bits in a bit-plane manner can notably enhance the compressibility of uniformly typed data blocks, which offers a more hardware-friendly implementation compared to traditional byte- or word-oriented layouts. EBPC [40] applied an extended bit-plane compression scheme to deep neural network accelerators, which demonstrates higher compression ratios by taking advantage of bit-level redundancy within activations. More recent approaches extend these concepts specifically to LLMs and other deep networks with mixed-precision arithmetic. For example, [41] organizes quantized parameters into distinct bit-planes to facilitate "anyprecision" execution. It allows loading only the necessary precision based on dynamic accuracy-speed trade-offs. In a similar vein, SmartQuant [42] uses bit-plane techniques to store LLM weights in a partially quantized format, dynamically retrieving different subsets of bit-planes according to the context's numerical requirements.

TABLE V SILICON COST AT 2 GHz WITH 32 LANES FOR LZ4 AND ZSTD LOSSLESS COMPRESSION. ("SL" DENOTES SINGLE-LANE.)

| Engine | BlockSize<br>(bits) | SL Area (mm <sup>2</sup> ) | SL Power<br>(mW) | LaneTotArea (mm²) | LaneTotPower (mW) | SL Thpt<br>(Gb/s) |
|--------|---------------------|----------------------------|------------------|-------------------|-------------------|-------------------|
| LZ4    | 16384               | 0.05669                    | 696.515          | 1.81413           | 2228.846          | 512               |
| LZ4    | 32768               | 0.07557                    | 885.258          | 2.41811           | 2832.826          | 512               |
| LZ4    | 65536               | 0.15106                    | 1640.233         | 4.83403           | 5248.745          | 512               |
| ZSTD   | 16384               | 0.08357                    | 1363.715         | 2.67429           | 4363.886          | 512               |
| ZSTD   | 32768               | 0.10245                    | 1552.458         | 3.27827           | 4967.866          | 512               |
| ZSTD   | 65536               | 0.17794                    | 2307.433         | 5.69419           | 7384.785          | 512               |

Compression in LLMs: Post-training quantization approaches, such as GPTQ [8] and AWQ [43] have been developed to convert model weights and activations to lower bit-width representation, effectively reducing model size and inference latency. SparseGPT [44] applies structured pruning to LLMs, effectively reducing model size and computational requirements. Spqr [45] explores sparse quantization, combining pruning and quantization to enhance efficiency.

Contextual Importance in LLMs: Deja Vu [6] is a framework that predicts contextual sparsity on-the-fly for each input. PowerInfer [46], LLM in a flash [16] extend this work to hybrid CPU/GPU and flash platforms. MoE and MoD [7] dynamically adjust the depth and experts across different tokens or layer within a model based on the input tokens.

Lossless Memory Compression: Multiple approches of implementation of hardware-based main memory compression has been proposed [47]–[50]. Key-value store systems like ZipCache [51], ZipKV [52] apply memory block compression to the DRAM tier to reduce the memory footprint. Contemporary data analytics systems such as SAP HANA [53], Oracle [54], and Snowflake [55] apply block compression to reduce their memory consumption.

#### VI. CONCLUSION

We presented *CXL-NDP*, a transparent near-data architecture that raises effective bandwidth in CXL Type-3 memory without changing CXL.mem or application code. The design does two concrete things. First, a precision-scalable bit-plane layout serves only the planes needed for the target format (for example FP12, FP8, FP4), so DRAM work scales with runtime precision. Second, device-side lossless compression, paired with channel-wise KV reordering and exponent-delta metadata, cuts internal transfers while keeping values exact.

On public LLMs, weights shrink by 25.2% on average and the KV cache by 44.8–46.9%. In DRAMSim3, precision-proportional fetch reduces DRAM access energy by up to 40.3% and load latency by up to 42.1%. In an end-to-end setup that serves LLaMA 3.1 70B on an H100 with CXL offload, CXL-NDP improves throughput by 43% at a 65k token context and pushes past the 105k-token limit that breaks a passive CXL tier. A RTL synthesized 7 nm implementation sustains 2 TB/s with modest area and power that 4.83 mm² and 5.25 W for an LZ4 configuration, or 5.69 mm² and 7.38 W for ZSTD at 2 GHz.

To fully utilize CXL memory for LLM inference, we must both make the most of the bandwidth we already have on the CXL link and DDR channels and raise the *effective* bandwidth seen by the accelerator. CXL-NDP does both. Bit-plane-aligned fetch avoids moving unnecessary bits, and lossless compression cuts internal bytes (about 25% for weights and 45% for KV). It works under unmodified CXL.mem and composes with standard quantization. The same controller is a practical anchor for next steps such as in-device prefetch, KV page scheduling, and cache-aware placement to keep the link saturated with useful data.

#### REFERENCES

- [1] NVIDIA, "Nvidia h200 tensor core gpu," 2023. [Online]. Available: https://www.nvidia.com/en-us/data-center/h200/
- [2] D. Patel and A. Ahmad, "The memory wall: Past, present, and future of dram," SemiAnalysis, Sep 2024, discusses HBM costing ~3× DDR5 per GB.
- [3] CXL Consortium and ABI Research, "Cxl 101: Opportunities and challenges for compute express link," https://computeexpresslink.org/ wp-content/uploads/2024/11/CR-CXL-101\_FINAL.pdf, 2024, models ~52-55% lower cost/GB with CXL memory expansion and reduced stranded memory.
- [4] PCI-SIG, "PCI Express 7.0 Specification, Version 0.5 Now Available," https://pcisig.com/blog/pcie-70-specification-version-05-now-available-full-draft-available-members, 2024, accessed: August 18, 2025.
- [5] JEDEC, "JEDEC Releases New LPDDR6 Standard to Enhance Mobile and AI Memory Performance," 2024, https://www.jedec.org/news/pressreleases/jedec-releases-new-lpddr6standard-enhance-mobile-and-ai-memory-performance, Accessed: August 18, 2025.
- [6] Z. Liu, J. Wang, T. Dao, T. Zhou, B. Yuan, Z. Song, A. Shrivastava, C. Zhang, Y. Tian, C. Re, and B. Chen, "Deja vu: Contextual sparsity for efficient llms at inference time," PMLR, pp. 22 137–22 176, 2023.
- [7] D. Raposo, S. Ritter, B. Richards, T. Lillicrap, P. C. Humphreys, and A. Santoro, "Mixture-of-depths: Dynamically allocating compute in transformer-based language models," arXiv preprint arXiv:2404.02258, 2024.
- [8] E. Frantar, S. Ashkboos, T. Hoefler, and D. Alistarh, "Gptq: Accurate post-training quantization for generative pre-trained transformers," arXiv preprint arXiv:2210.17323, 2022.
- [9] J. Tang, Y. Zhao, K. Zhu, G. Xiao, B. Kasikci, and S. Han, "Quest: Query-aware sparsity for efficient long-context llm inference," arXiv preprint arXiv:2406.10774, 2024.
- [10] LZ4. Lz4 compression algorithm. https://lz4.github.io/lz4/.
- [11] Zstandard (zstd). Facebook. https://github.com/facebook/zstd.
- [12] D. D. Sharma, R. Blankenship, and D. S. Berger, "An introduction to the compute express link (CXL) interconnect," arXiv preprint arXiv:2306.11227, 2023.
- [13] Y. Zhong, D. S. Berger, C. Waldspurger, R. Wee, I. Agarwal, R. Agarwal, F. Hady, K. Kumar, M. D. Hill, M. Chowdhury et al., "Managing memory tiers with {CXL} in virtualized environments," in 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24), 2024, pp. 37–56.
- [14] Meta-AI, "Llama 3.1: A collection of multilingual large language models," 2024, accessed: 2025-03-22. [Online]. Available: https://huggingface.co/meta-llama/Llama-3.1-405B-Instruct

- [15] DeepSeek-AI, "Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning," 2025, accessed: 2025-03-22. [Online]. Available: https://huggingface.co/deepseek-ai/DeepSeek-R1
- [16] K. Alizadeh, I. Mirzadeh, D. Belenko, K. Khatamifard, M. Cho, C. C. Del Mundo, M. Rastegari, and M. Farajtabar, "Llm in a flash: Efficient large language model inference with limited memory," arXiv preprint arXiv:2312.11514, 2023.
- [17] W. Wang, W. Chen, Y. Luo, Y. Long, Z. Lin, L. Zhang, B. Lin, D. Cai, and X. He, "Model compression and efficient inference for large language models: A survey," arXiv preprint arXiv:2402.09748, 2024.
- [18] S. Zhang, S. Roller, N. Goyal, M. Artetxe, M. Chen, S. Chen, C. Dewan, M. Diab, X. Li, X. V. Lin, T. Mihaylov, M. Ott, S. Shleifer, K. Shuster, D. Simig, P. S. Koura, A. Sridhar, T. Wang, and L. Zettlemoyer, "OPT: Open pre-trained transformer language models," arXiv preprint arXiv:2205.01068, 2022.
- [19] T. Zhu, X. Qu, D. Dong, J. Ruan, J. Tong, C. He, and Y. Cheng, "Llama-moe: Building mixture-of-experts from llama with continual pretraining," arXiv preprint arXiv:2406.16554, 2024.
- [20] Y. Bisk, R. Zellers, J. Gao, and Y. Choi, "Piqa: Reasoning about physical commonsense in natural language," in *Proceedings of the AAAI* conference on artificial intelligence, vol. 34, no. 05, 2020, pp. 7432– 7439.
- [21] K. Sakaguchi, R. L. Bras, C. Bhagavatula, and Y. Choi, "Winogrande: An adversarial winograd schema challenge at scale," *Communications of the ACM*, vol. 64, no. 9, pp. 99–106, 2021.
- [22] D. Paperno, G. Kruszewski, A. Lazaridou, Q. N. Pham, R. Bernardi, S. Pezzelle, M. Baroni, G. Boleda, and R. Fernández, "The lambada dataset: Word prediction requiring a broad discourse context," arXiv preprint arXiv:1606.06031, 2016.
- [23] D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Steinhardt, "Measuring massive multitask language understanding," arXiv preprint arXiv:2009.03300, 2020.
- [24] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu, "Exploring the limits of transfer learning with a unified text-to-text transformer," *Journal of machine learning* research, vol. 21, no. 140, pp. 1–67, 2020.
- [25] S. Merity, C. Xiong, J. Bradbury, and R. Socher, "Pointer sentinel mixture models," arXiv preprint arXiv:1609.07843, 2016.
- [26] W. Kryściński, N. Rajani, D. Agarwal, C. Xiong, and D. Radev, "Book-sum: A collection of datasets for long-form narrative summarization," arXiv preprint arXiv:2105.08209, 2021.
- [27] NVIDIA Corporation, "NVIDIA GB200 NVL72," https://www.nvidia. com/en-us/data-center/gb200-nvl72/, 2024, accessed: August 20, 2024.
- [28] "NVIDIA DGX B200," 2024, https://www.nvidia.com/en-us/data-center/dgx-b200/, Accessed: August 20, 2024.
- [29] Z. Liu, J. Yuan, H. Jin, S. Zhong, Z. Xu, V. Braverman, B. Chen, and X. Hu, "Kivi: A tuning-free asymmetric 2bit quantization for kv cache," arXiv preprint arXiv:2402.02750, 2024.
- [30] T. Zhang, J. Yi, Z. Xu, and A. Shrivastava, "Kv cache is 1 bit per channel: Efficient large language model inference with coupled quantization," *Advances in Neural Information Processing Systems*, vol. 37, pp. 3304–3331, 2024.
- [31] WikiText language modeling dataset. https://huggingface.co/datasets/ Salesforce/wikitext.
- [32] A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Yang, A. Fan et al., "The llama 3 herd of models," arXiv preprint arXiv:2407.21783, 2024.
- [33] A. Q. Jiang, A. Sablayrolles, A. Roux, A. Mensch, B. Savary, C. Bamford, D. S. Chaplot, D. de las Casas, E. B. Hanna, F. Bressand, G. Lengyel, G. Bour, G. Lample, L. R. Lavaud, L. Saulnier, M.-A. Lachaux, P. Stock, S. Subramanian, S. Yang, S. Antoniak, T. L. Scao, T. Gervet, T. Lavril, T. Wang, T. Lacroix, and W. E. Sayed, "Mixtral of experts," arXiv preprint arXiv:2401.04088, 2024.
- [34] S. Li, Z. Yang, D. Reddy, A. Srivastava, and B. Jacob, "Dramsim3: A cycle-accurate, thermal-capable dram simulator," *IEEE Computer Architecture Letters*, vol. 19, no. 2, pp. 106–109, 2020.
- [35] E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, and W. Chen, "Lora: Low-rank adaptation of large language models," arXiv preprint arXiv:2106.09685, 2021.
- [36] AutoFP8: A Framework for Automatic Mixed-Precision Quantization. https://github.com/neuralmagic/AutoFP8.
- [37] N. Ding, Y. Chen, B. Xu, Y. Qin, Z. Zheng, S. Hu, Z. Liu, M. Sun, and B. Zhou, "Enhancing chat language models by scaling high-quality instructional conversations," arXiv preprint arXiv:2305.14233, 2023.
- [38] L. T. Clark, V. Vashishtha, L. Shifren, A. Gujja, S. Sinha, B. Cline, C. Ramamurthy, and G. Yeric, "Asap7: A 7-nm finfet predictive process design kit," *Microelectronics Journal*, vol. 53, pp. 105–115, 2016.

- [39] J. Kim, M. Sullivan, E. Choukse, and M. Erez, "Bit-plane compression: Transforming data for better compression in many-core architectures," ACM SIGARCH Computer Architecture News, vol. 44, no. 3, pp. 329–340, 2016.
- [40] L. Cavigelli, G. Rutishauser, and L. Benini, "Ebpc: Extended bitplane compression for deep neural network inference and training accelerators," 2019. [Online]. Available: https://arxiv.org/abs/1908. 11645
- [41] Y. Park, J. Hyun, S. Cho, B. Sim, and J. W. Lee, "Any-precision llm: Low-cost deployment of multiple, different-sized llms," arXiv preprint arXiv:2402.10517, 2024.
- [42] R. Xie, A. U. Haq, L. Ma, K. Sun, S. Sen, S. Venkataramani, L. Liu, and T. Zhang, "Smartquant: Cxl-based ai model store in support of runtime configurable weight quantization," *IEEE Computer Architecture Letters*, 2024
- [43] J. Lin, J. Tang, H. Tang, S. Yang, W.-M. Chen, W.-C. Wang, G. Xiao, X. Dang, C. Gan, and S. Han, "Awq: Activation-aware weight quantization for on-device Ilm compression and acceleration," *Proceedings of Machine Learning and Systems*, vol. 6, pp. 87–100, 2024.
- [44] E. Frantar and D. Alistarh, "Sparsegpt: Massive language models can be accurately pruned in one-shot," in *International Conference on Machine Learning*. PMLR, 2023, pp. 10323–10337.
- [45] T. Dettmers, R. Svirschevski, V. Egiazarian, D. Kuznedelev, E. Frantar, S. Ashkboos, A. Borzunov, T. Hoefler, and D. Alistarh, "Spqr: A sparsequantized representation for near-lossless llm weight compression," arXiv preprint arXiv:2306.03078, 2023.
- [46] Y. Song, Z. Mi, H. Xie, and H. Chen, "Powerinfer: Fast large language model serving with a consumer-grade gpu," arXiv preprint arXiv:2312.12456, 2023.
- [47] E. Choukse, M. Erez, and A. R. Alameldeen, "Compresso: Pragmatic main memory compression," in 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 2018, pp. 546–558.
- [48] M. Ekman and P. Stenstrom, "A robust main-memory compression scheme," in 32nd International Symposium on Computer Architecture (ISCA). IEEE, 2005, pp. 74–85.
- [49] G. Pekhimenko, V. Seshadri, Y. Kim, H. Xin, O. Mutlu, P. B. Gibbons, M. A. Kozuch, and T. C. Mowry, "Linearly compressed pages: A low-complexity, low-latency main memory compression framework," in *Proceedings of the Annual IEEE/ACM International Symposium on Microarchitecture*, 2013, pp. 172–184.
- [50] J. Zhao, S. Li, J. Chang, J. L. Byrne, L. L. Ramirez, K. Lim, Y. Xie, and P. Faraboschi, "Buri: Scaling big-memory computing with hardware-based memory expansion," ACM Transactions on Architecture and Code Optimization (TACO), vol. 12, no. 3, pp. 1–24, 2015.
- [51] R. Xie, L. Ma, A. Zhong, F. Chen, and T. Zhang, "Zipcache: A dram/ssd cache with built-in transparent compression," in *Proceedings of the 10th International Symposium on Memory Systems (MEMSYS)*, 2024.
- [52] L. Ma, R. Xie, and T. Zhang, "Zipkv: In-memory key-value store with built-in data compression," in *Proceedings of the 2023 ACM SIGPLAN International Symposium on Memory Management*, 2023, pp. 150–162.
- [53] F. Färber, S. K. Cha, J. Primsch, C. Bornhövd, S. Sigg, and W. Lehner, "Sap hana database: data management for modern business applications," ACM Sigmod Record, vol. 40, no. 4, pp. 45–51, 2012.
- [54] T. Lahiri, S. Chavan, M. Colgan, D. Das, A. Ganesh, M. Gleeson, S. Hase, A. Holloway, J. Kamp, T.-H. Lee, J. Loaiza, N. Macnaughton, V. Marwah, N. Mukherjee, A. Mullick, S. Muthulingam, V. Raja, M. Roth, E. Soylemez, and M. Zait, "Oracle database in-memory: A dual format in-memory database," in *IEEE International Conference on Data Engineering (ICDE)*, 2015, pp. 1253–1258.
- [55] B. Dageville, T. Cruanes, M. Zukowski, V. Antonov, A. Avanes, J. Bock, J. Claybaugh, D. Engovatov, M. Hentschel, J. Huang, A. W. Lee, A. Motivala, A. Q. Munir, S. Pelley, P. Povinec, G. Rahn, S. Triantafyllis, and P. Unterbrunner, "The snowflake elastic data warehouse," in *Proceedings of the International Conference on Management of Data (SIGMOD)*, 2016, pp. 215–226.

**Rui Xie** is a Ph.D. candidate in the Department of Electrical, Computer, and Systems Engineering at Rensselaer Polytechnic Institute, Troy, NY, USA. His research focuses on efficient memory system design for Large Language Models. He received the B.E. degree in Microelectronics and Science and Engineering from Southern University of Science and Technology, China.

**Asad Ul Haq** is currently pursuing the Ph.D. degree in the Department of Electrical, Computer, and Systems Engineering at Rensselaer Polytechnic Institute, Troy, NY, USA. His research interest is in computational CXL memory. He received the B.S. degree in Electrical Engineering from the National University of Sciences and Technology, Pakistan.

**Linsen Ma** received the Ph.D. degree in 2025 from the Department of Electrical, Computer, and Systems Engineering at Rensselaer Polytechnic Institute, Troy, NY, USA. His doctoral research focused on efficient data management over computational storage. He received the M.S. and B.S. degree in Electrical Engineering from Rensselaer Polytechnic Institute.

**Yunhua Fang** is currently pursuing the Ph.D. degree in the Department of Electrical, Computer, and Systems Engineering at Rensselaer Polytechnic Institute, Troy, NY, USA. His research focuses on memory system design for AI-centric infrastructure. He received his B.S. degree in Computer Science from the University of California, Davis.

**Zirak Burzin Engineer** is a student at Wiseburn Da Vinci Science High School, El Segundo, CA, USA. His interests include computer architecture and machine learning. He contributed to this research through a summer program focused on high-performance computing.

Liu Liu is an Assistant Professor in the Department of Electrical, Computer, and Systems Engineering at Rensselaer Polytechnic Institute, Troy, NY, USA. His research interests include elastic AI computing systems and architecture design. He received the B.S. degree from the University of Electronic Science and Technology of China, the M.S. degree in electrical and computer engineering, and the Ph.D. degree in computer science, both from the University of California, Santa Barbara.

Tong Zhang (Fellow, IEEE) is a Professor in the Department of Electrical, Computer, and Systems Engineering at Rensselaer Polytechnic Institute, Troy, NY, USA. His current research areas are computer system design with the focus on memory-centric computing for AI and data science. He received the B.S. and M.S. degrees in electrical engineering from Xi'an Jiaotong University, China, and the Ph.D. degree in electrical and computer engineering from the University of Minnesota, Minneapolis.